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Abstract 

We consider the estimation of regression models on strata defined using a categorical covariate, in 
order to identify interactions between this categorical covariate and the other predictors. A basic 
approach requires the choice of a reference stratum. We show that the performance of a penalized 
version of this approach depends on this arbitrary choice. We propose a rehned approach that by¬ 
passes this arbitrary choice, at almost no additional computational cost. Regarding model selection 
consistency, our proposal mimics the strategy based on an optimal and covariate-specihc choice for 
the reference stratum. Results from an empirical study conhrm that our proposal generally outper¬ 
forms the basic approach in the identihcation and description of the interactions. An illustration is 
provided on gene expression data. 


1. Introduction 


We consider the estimation of regression models when the population under study is stratified, as is 
standard in epidemiology and clinical research. For instance, when studying relapse after a primary 
breast cancer, it is now common to analyze various histological subtypes at once (|Voduc et al. 


2010 Rosner et al. 20131. In order to accurately estimate the risk of relapse according to cancer 


subtype, risk factors that interact with cancer subtype need to be identified, and the corresponding 
interactions need to be precisely described. In pharmacokinetics, it is often of interest to describe 


how parameters related to absorption and clearance depend on dosage and type of adjuvant. In Ollier 


et al. (20161, for example, non-linear mixed effect models are estimated on strata defined according 
to the treatment dosage and the type of adjuvant. In Section]^ we study the association between the 
expression of the epidermal growth factor receptor gene, and those of 44 other transcription factors 


at eight time points of a differentiation process. We use a data set described in Kouno et al. (20131 


where expression of these factors has been measured by profiling, for each time point, 120 different 
single cells. In this application, our aim is to describe how the association between the epidermal 
growth factor receptor gene and the other 44 transcription factors varies over these eight strata. 

In all these examples, the general objective is to study the relationship between a response 
variable y G IR and a vector of p > 1 predictors x = (xi,..., Xp) G IR^ over K >1 strata defined 
according to a categorical covariate Z of primary interest, such as cancer subtype. A key objective 
is to determine how Z modifies the effect of x on y, that is to identify and describe interactions 
between predictors x and the categorical effect modifier Z ( Gerfheiss & Tutz[ 2012| ). Identifying 
and taking account of these interactions is important even when the assessment of categorical effect 
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modification is not the main objective. In particular, estimating one model per stratum or one model 
on all the strata pooled together generally leads to overfitting or underfitting, respectively. 

For simplicity, we mostly focus on linear regression models. Denote by = (/3^ ^ Pkp) ^ 

IRP, for any k G [K] = {1,..., iF}, the parameter vector describing the association between y and 
X in the model corresponding to stratum Z = k. For any y G [p], denote by dj G [K] the number of 
distinct values in the set {/3i • • • :/3k j}- Further consider the partition of [K] 

such that j = ■ if and only if ^ some d G [dj]. The full description of 

how Z modifies the effect of Xj on y relies on the identification of this partition. The total number of 
partitions of [K] is the Kth Bell number Bk, with, for instance, = 52. Standard statistical test 
procedures would require the comparison of {BkY models for the identification of the p partitions, 
so they are not well-suited to fully describe how Z modifies fhe effect of x on y. 

Penalized approaches have been advocated in this context, which can be seen as a special case of 


multi-task learning (Evgeniou & Pontil 20041. In particular, a version of the adaptive generalized 
fused lasso ( Tibshirani et al.[[2005 Viallon et al.[ 20161 has been shown to enjoy an asymptotic 
oracle property if Kp does not grow with the sample size n ([Gertheiss & Tutz[ |2012 Oelker et al. 


2014 1 . If Kp is fixed then this method identifies partitions ..., for all j G [p] with 

probability tending to one as n —)• oo, under mild assumptions. However, theoretical results in a 


non-asymptotic framework are still lacking for this method. Results obtained by Sharpnack et al. 


(20121 for generalized fused lasso estimates in the Gaussian means setting are not straightforward to 
extend to our context but they suggest that this method might only be able to identify the partitions 
under very particular settings. See Section [2^ for more details. 

An alternative strategy, which is standard in epidemiology, consists in first selecting a reference 
stratum, say the first one, then coding the other strata by indicators 1(Z = k), and finally including 
them into the regression model, along with their interactions with the predictors. This corresponds 
to considering the decompositions + 6^, for all k G [K], with (5j[' = Op, the null vector in 


IRP. Then, the lasso (Tibshirani 19961 can be used to identify null components in vectors fdf and 


6^, for A: / 1. Of course, this strategy can generally identify only one element of each partition 


{/C 


( 1 ) 




}. However, a natural question is whether it is sparsistent, that is whether it can 


identify the sets = {j G [p] : Y 0} and = {{k,j) G [K] x [p] : Y fdlj} with 

high probability. If so, it could partially describe the effect of Z on the association between y and x, 
while results of |Sharpnack et ah ( 2012| ) suggest that the method based on the fused lasso generally 
fails to fully describe it. 

In this article, we establish that the sparsistency of this basic approach depends on the choice 
of the reference stratum. We then present a refined approach which bypasses this arbitrary choice. 
Under linear regression models, we show that our proposal is sparsistent under conditions similar to 
those ensuring the sparsistency of the approach based on an optimal and covariate-specific choice of 
the reference stratum. In addition, our proposal can be implemented with available packages under 
a variety of models, at approximately the same computational cost as that of the basic approach. 


2. Methods 

2.1 Notations and setting 

Methods are first presented in the linear regression model for ease of notation. Extensions to gener¬ 
alized linear models ([McCullagh & Nelder 19891 are briefly presented in Section 2.4 
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For any positive integer m > 1, define [m] = {1,..., m}. Let Om and be the vectors of size 
m with components all equal to 0 and 1 respectively, and let Im be for the (m x m) identity matrix. 
For any vector x = (xi,..., Xm)^ £ IR™, let supp(x) = {j G [m] : xj / 0} denote its support. 
We further set ||x||g = |xj|'^)^/'^ for any real number q G (0, oo), ||x||oo = maxj \xj\ and 

||x||o = |supp(x)|, where \E\ is the cardinality of the set E. For any set E C [m], let xg denote the 
vector of 1RI®I with components {xj)j^E- For any real matrix M, let Mj be its jth column, and let 
Me be the sub-matrix made of columns {Mj)j^E- Further denote the smallest singular value of M 
by Amin(M). Finally, ]!(•) is the indicator function. 

Denote the number of levels of variable Z, that is the number of strata, by iF > 1. Let be 

the number of observations in stratum k G [K], and denote the total number of observations by 
n = Ylke[K] ^ ^ further denote the response vector in stratum k by y{k) g 

Similarly, let be the {uk x p) design matrix in stratum k. For all k G [K], we assume that 
yik) — with noise vector ..., G IR"'=. Vectors (31 G IR^ include 

the Kp parameters to be estimated. 


2.2 Basic approach 


A basic approach consists in picking a reference stratum for any j, say £j. Most often in practice, 
ij is chosen so that it does not depend on j ; here we consider the most general version. Introduce 
£ = {£i,...,£p) G [K]P, p} = • • •, I3}^^p) and 51 = (31 - p}. The basic approach relies 

on the decomposition (^l = p} + 5^, for all k G [K], with (5|, j = 0 (Gertheiss & Tutz 
Following the lasso (Tibshirani 19961, estimates of p^ and 5(., k G [K], can be defined as 


20121 . 


argmm 

j=0 for all j S [p] 
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E 

. k=l 


\\y{k) _xik)(p + Sk)\\l 
2n 


K 


+ Ai||/i||i + ^ A2,fc||(5fc||] 


( 1 ) 


k=l 


for appropriate non-negative Ai and A 2 fc. As will be seen in Sections 2.5 and 2.6 conditions en 


suring the sparsistency of this approach depend on the arbitrary choice of the vector of reference 
strata £. As a matter of fact, the underlying model dimension is \\p}\\o + ll/^fe “ F|IIo> which 

depends on £. The lowest dimension, and then the best possible performance, is attained if p*^ is 
a mode of the collection of values j3\ ^ , j3*j^ ■), mode(0, ..., for all j G \p]. For 

uny j G [p], such an optimal and covariate-specific reference stratum will be denoted by i* below. 
Because £* = ... ,£*) is generally unknown, the corresponding optimal version of the basic 

approach cannot be implemented in practice. 


2.3 Our proposal 

Our proposal aims at bypassing the arbitrary choice of the vector of reference strata £, while mim¬ 
icking the optimal version of the basic approach. We consider an overparametrization involving 
{K + l)p parameters, (31 = p* + 7 ^ for any k G [K]. It generalizes the decomposition used in 
the basic approach in the sense that no coefficient is constrained to be zero here. For nonnegative 
values of Ai and A 2 ,fc’s, our proposal returns 


(p, 7 i ...,%) G argmin 


\\yik)_xW{p + ^,)g 


2n 


+ Ai 


K 

1 + ^ A2,fe 

k=l 


( 2 ) 
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Working with a large enough value for \ 2 ^r, for some r G [K], is equivalent to constraining 7 ^ = Op, 
and then Jl = Pr, and reduces to the basic approach with i = (r,..., r). More generally, setting 
r = {ti, ... ,tk) with = A 2 ,fc/Ai, and defining the shrunk and r-weighted version of the median 
of ( 61 ,..., bx) as WSmedian( 6 i,..., bx] t) = argmin^(|5| + Ylik&[K] '’'k\bk — b\), it is easy to see 
that Itj G WSmedian(/3ij,..., fjx,j',T). In other words, for any particular value of the A 2 ,fc/Ai 
ratios, our approach encourages solutions (/3i = ^ + 71 ,..., (3x = V- + Ik) with a sparse vector 
Jl and sparse vectors of differences jk = Pk — and with the overall effect of the jth covariate juj 
defined as WSmedian(;0ij,..., I5x,j] t). 

Moreover, working with a large enough value for Ai is equivalent to constraining /I = Op, and 
our approach then reduces to K independent lasso’s. In contrast, working with large enough A 2 ,fc 
values is equivalent to constraining f3k = 'p- for all k, and our approach then reduces to one lasso run 
on all the strata pooled together. 


2.4 Rewriting as a lasso on a transformation of the original data 

Our proposal reduces to the lasso on a simple transformation of the original data, just as the basic 
approach does. Set '3^ = , • • • j the vector containing the n observations of the 

response variable. For any k G [K], introduce = {j ^ \p] ■ k ^ £j}, with £j still denoting the 
reference stratum chosen for covariate j in the basic approach. Note that J2k ~ ^)P 

and set Now introduce 


xm If'/n . 

. 0 \ 


/ xW 

xW/n ■ 

• " ^ 


’ 

A:’o = 




A 0 




0 

• Xi^^Tx J 


Criteria to be minimized in ([T]l and Q reduce to the lasso 

+ ( 3 ) 


with X set to either Xi, for the basic approach, or Xq, for our proposal, and 9 a vector of IR^^ or 
]P{^(^+i)p as appropriate. Therefore, our proposal comes at almost no additional computational cost 
compared to the basic approach. 

This rewriting as a lasso extends to generalized linear models, Cox models, etc., and makes our 
proposal directly implementable using the glmnet package of |Friedman et al. ( 2010| l, for instance. 
Considering logistic models, set £iogistic(y, z) = yiZi - log(l + e^“) for any y G {0, !}"■ and 

2 : G IR"^. The criterion of our proposal writes as the logistic lasso,— £iogistic(i^, XqO) + Ai ||0||i. In 
addition, the glmnet package can benefit from the sparse structure of Xq. 


2.5 Sparsistency 

From now on we assume that = mode(0, /3l , I3*x j) is uniquely defined for all j e b]> 

for ease of notation. For any vector of reference strata i = {£ 1 ,... ,ip) G [K]^, introduce the 
sets Si = {j G [p] : / 0} and Ti = {{k,j) G [K] x [p] : Pp- ^ P}. -}. Further define 

9*i = {pf,T2ril,...,Tx7fx)^ G with p* . = PI - and 7 ;, = (/3* - The 
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basic approach is sparsistent if it identifies these two sets, or, equivalently, the set = supp(0|), 
with high probability. Now, consider an optimal vector of reference strata t* = {£\,..., 1^) such 
thatfor all y E \p]. Define ti 7 qJ, ..., E where 

%k ~ ^k~ l^i* ■ "'ill proposal is sparsisfenf if if idenfifies Si* and Ti *, or, equivalenfly, 

fhe sef Jo = supp(0g), wifh high probabilify. 

For fhe lasso fo be sparsisfenf, a sufficienf and almosf necessary condifion on fhe design mafrix 
is fhe irrepresenfabilify condifion ( Zhao & Yu[ 2006| Wainwrightl 20091. Consider fhe general 
formulafion (|^ of fhe lasso and denofe by 9* fhe frue value of fhe paramefer vecfor fo be esfimafed. 
Defining J* = supp(0*), fhe mafrix X fulfills fhe irrepresenfabilify condifion, wifh respecf fo J*, 
if A m\„ (Xj,X.T*) > Cmin for some Cmin > 0 and max^^j. \\{X'j,Xj*)~^XjtXj\\i < 1. Using fhe 
vecfor of reference sfrafa I in fhe basic approach, fhe irrepresenfabilify condifion of mafrix Xi wrifes 
as {IC)i, while in our approach, fhe irrepresenfabilify condifion of mafrix Xq wrifes as {IC)q'. 


{IC)e Amin ) > Ci > 0 and ce = max 11 {Xij^Xu^) Xij^ 11 1 < 1, 

HJi 

(JC')o Amin(A’g^oA'oJo) > C'o > 0 and cq = max ||(A’g^oA:bjo)"^'^o^o'^Oilli < 1 


The comparison of {IC)i* and {IC)o is of particular inferesf. Firsf, because = ^oJo’ 1*^"® 

Ai* = AoJo and Amin(A’i^) = Amin(A:’o^o Abjo). Second, fhe maxima in fhe definitions 
of Co and ci* are faken over Jg and J^*, respectively, wifh | Jq| = p + | J|* |. Indeed, columns corre¬ 
sponding fo 7 q ■, for j E \p], are presenf in mafrix Xq while fhey are absenf from Xi*. Moreover, 
fhe corresponding indexes belong fo Jg since ^Qi* j = 0. Therefore, fhe only difference befween 

{IC)o and {IC)i* comes from fhe facf fhaf cq > Ci*, and {IC)o is only slighfly sfronger fhan 
{IC)i*. Focusing on fhe case of balanced sfrafa and orthogonal designs in each sfrafum. Lemma 
l^in Secfion 2.6 sfafes fhaf fhese fwo condifions are idenfical in fhis particular case. If furfher ex- 
plicifly relafes fhem fo fhe rafios Tk = A 2 ,fc/Ai and fo fhe maximum level of heferogeneify fhaf is 
allowed among fhe collections of values {Pi j • • •, Pk j)’ j ^ \p]- Iri addition, fhey are shown 
fo be generally weaker fhan {IC)i for fhe choice £= {r,... ,r), for some r E [K]. 

We can now sfafe Theorem [TJ according fo which our proposal idenfifies Si* and Ti* under 
nearly fhe same assumpfions as fhose required by fhe opfimal version of fhe basic approach if fhe 
were given in advance. More precisely, besides fhe facf fhaf (JC)o is generally a liffle sfronger 

fhaf {IC)i* , fhe only difference lies in fhe ferms ) and (A^°^, where K + 1 replaces 

K. Our resulf is a consequence of Theorem 1 in Wainwrighf (2009|). 


Theorem 1 Assume that the noise variables {£^^'^)i^\^nk\,k&[K] independent and identically dis- 
tributed centered sub-Gaussian variables with parameter a > 0. Further assume that | 

1 for all {k,j) E [K] x [p]. Introduce tq > 0 and set Tk = TQ{nk/np^‘^ for all k E [K]. If{IC)i* 
holds then set 71 = 1 — c^*. Further set 70 = 1 — cq (JC')o holds. For p E {0,1}, introduce 


2 < 


A 


(v) 


> 


7r,min(l,ro) 


2cr^ log{{K + p)p)) ] 

n j 


1/2 


o(v) _ \iv) 

Pmin — '^1 


(|s,.| + |r,.|)i/2 


Ct 


+ 4- 


cr 


a 


1/2 


If {IC)i* holds then solutions 9i* o/([^ with X = Xi* and Ai = A® as above are such that, 
with probability at least 1 — 4exp(—oonAi)/or constant oq > 0.' (i) 6i* is uniquely defined, 
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(m) = 0|jc^|, and (in) \\Oi*j^, 

.( 0 ) 

'^min 


— lamin' V> addition, \f5* 


> & for all 


j G Se* and \Pkj~t^j\ > l^mL/far all {k,j) G fa*, then Ji*, hence both Se* and fa*, are 
perfectly identified with probability at least 1 — 4exp(—aonAf). 

If {IC)o holds then solutions Oq of (3) with X = Xq and Ai = A^^^ as above are such that, with 
probability at least 1 — 4exp(—ainAi)/or some constant ai > O; (i) 9q is uniquely defined, (ii) 


6'oJg = 0|jc|, and (Hi) If in addition, \ j3*\ > far all j G S^* and 


( 1 ) 


j ~ ^ far all {k,j) G fa*, then Jq, hence both Se* and fa*, are perfectly identified 

with probability at least 1 — 4exp(—oinAf). 


This result especially confirms that it is harder to identify fa* than Se*, in the sense that hetero¬ 
geneities have to be at least ■ — /3*| > (n/nfc)^/^/3^j|j/ro for {k,j) G fa*, while \fi*\ has only 
(ri) 

to be greater than for j G Se* ■ 


2.6 The orthogonal and balanced case 

It is instructive to inspect in more detail the simple setting where rifc = n/K and X^^^)/nk = 

faf, for all A: G [K], This orthogonality assumption does not make matrices Xe and Xq orthogonal 
and is therefore not sufficient for the irrepresentability condition. 

For any vector of reference strata £ and all j G [p], define = {/c G [iT] : Kj = ^l,]- Set 
Vifi = maXj^ 5 ^ \{k G [K] : - j}| if Se [K] and 0 otherwise and Vep = max^gs^ \{k G 

[K] '■ fik j - j}l if ‘S'r / 0 and —cx) otherwise. For all k G [K], we set for some 

To > 0. 


Lemma 2 The matrix Xe fulfills the irrepresentability condition if and only if 


{sIC)e 


i^i/2 

0 < -- 

K - 2Vep 


< To < 




The matrix Xq fulfills the irrepresentability condition if and only if 


{sICfi 


0 < - 

K — 1 


< To < 


T3e*fl 


Conditions {sIC)(} and {sIC)e* in Lemma are identical. In the orthogonal and balanced 
case, the two sets of assumptions required by our proposal and the optimal version of the basic 
approach to identify the sets Se* and fa* are therefore identical, except for the terms (A^^\ 
and where K + 1 replaces K, as in Theorem ^above. 

In addition, {sIC)e*, or equivalently {sIC)o, imposes that 2Ve*,i + T’r*,o < K- In particular, 
if 'De*p = T)e*fl = T)e*, this implies that Ve* < K/3. The irrepresentability conditions {sIC)e* 
and {sIC)o are therefore directly related to the maximum level of heterogeneity among the values 
iPi j, • • •, Pk j)- Similarly, {IC)e imposes 2Vep + T)efl < K, which is generally a stronger con¬ 
straint. For simplicity, consider the situation where £ = (r, ..., r) for some r G [K], which is a 
common choice in practice. Without loss of generality, set r = 1. Then {IC)e entails that, for 
each j G [p], the effect of the jth covariate on most strata is while {sIC)e* and {sIC)o only 
entail that, for each j G [p], the effect of the jth covariate on most strata is mode(0, p ..., fix j)- 
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Moreover, if ^ = (1,..., 1) and {1} ^ ■ then we have Ti ^ Ti* and, possibly. Si ^ Si*: 

the identification of and Si is both less interesting and less likely and our proposal should be 
preferred over the basic approach. 

To recap, our non-asymptotic analysis shows that the partial description of the categorical effect 
modification due to Z through decompositions of the type = //* + 7 ^ is guaranteed only when 
the level of heterogeneity is not too high, that is when the number of nonzero components in vectors 
{lij, ■ ■ ■ t1*k j)^ 3 ^ *^00 high. In particular, the lowest level of heterogeneity is 

attained for the choice ^j,* = mode( 0 , PI j,..., Pk j)- proposal is able to target this optimal 
decomposition and is sparsistent under nearly the same assumptions as those required for the optimal 
version of the basic approach. 


2.7 Connection with the generalized fused lasso 

The fact that the level of heterogeneity must be low to ensure the sparsistency of our proposal as well 
as the sparsistency of the optimal version of the basic approach has connections with other results 
in the literature. Consider for instance the approach mentioned in Section[T] which was proposed by 
Gertheiss & Tutz ( 2012 i; see also Oelker et al.| ( |2014] l and |Viallon et al.| ( |2016[ ). This is based on a 
fusion penalty that encourages similarities among solutions Pk G IR^, k G [K], More precisely, for 
appropriate non-negative Ai and A 2 ’s, it returns estimators Pk defined as minimizers of fhe criferion 


E 

kelK] 


2n 


+ A: 


E 

kelK] 


1 + A 2 ^ 

ki<k2 


- I3k2 


(4) 


This criferion is fhaf of fhe generalized fused lasso, where fhe graph used in fhe penalfy is made of p 


cliques of size K (Viallon ef al.[ 2016 1 . The jfh clique corresponds fo fhe yfh predictor and connecfs 


all fhe componenfs in {Pij, ■ ■ ■ ,PK,j)' the K{K — l)/2 differences \Pki,j — Pk 2 ,j\^ for ki < /c 2 > 
appear in fhe penally term. 

Bolh fhe basic approach and our proposal are relaled to generalized fused lasso eslimafes loo. 
In particular, consider criferion ([T]l wifh fhe oplimal choice I* for fhe veclor of reference slrafa. If 
can be seen as a version of a generalized fused lasso, wifh a graph made of p slar-graphs of size K 
instead of p cliques: foreachj G [p], only fhe iT — 1 differences \Pk,j — Pe.,j\ appear in fhe penally 
lerm, for k ^ i* and i* G [K\ fixed. 

Non-asymplolic analyses of fhe sparsistency of generalized fused lasso eslimafes are scarce in 
fhe lileralure. [Sharpnack ef al. (|2012|l sludy Ihem in fhe normal means selling, which can be seen as 


a special case of fhe slralified linear regression considered here. Sharpnack el al. (20121 esfablish 
lhal generalized fused lasso eslimafes are sparsislenl only if fhe graph used in fhe fused penally is 
in good agreemenl wifh fhe Irue slruclure of fhe veclor of parameters; see also Qian & Jia| ( |2016 1 . 
Allhough if is nol slraighlforward lo extend lo our case, Ihese resulls suggesl lhal eslimafes derived 
from (0 can only be sparsislenl if fhe level of heferogeneily is nol too high. 

Our resulls precisely quantify fhe maximum level of helerogeneily above which a version of 
generalized fused lasso estimates, based on slar-graphs, can allain sparsistency in slralified regres¬ 
sion, in Ihe balanced and orlhogonal case. Because slar-graphs are less connected lhan cliques, il is 
likely lhal sparsistency for clique-based estimates, such as Ihose minimizing criterion (Q, requires 
an even lower maximum level of helerogeneily. Thai being said, sparsistency for clique-based es¬ 
timates refers to Ihe full identification of Ihe partitions {K!f \ ■ ■ ■ for j G [p]. For Ihe 
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basic approach, its optimal version and our proposal, sparsistency refers to the identification of one 
element of this partition only, say , and its complementary [K] \ ■ 

In the asymptotic regime, assuming that Kp is fixed and Uk/n —)• pk for some pk £ (0,1), 
for all /c G [K], oracle properfies have been derived for adapfive versions of clique-based esfimafes 
under mild assumpfions ( Gertheiss & Tufz[ 2012 1 : in particular, no assumption regarding fhe level 
of heferogeneify is required fo ensure perfecf recovery of fhe full parfifion , • • •, for 

all j G [p] . Similar resulfs are easily derived for adaptive versions of fhe basic approach for in- 
sfance. In view of Q, we can apply Theorem 2 of Zou ( |2006 1 fo show fhaf an adaptive version of 
fhe basic approach enjoys an oracle properly loo, wilhoul having fo assume any irrepresenlabilify 
condition. Here again, no assumpfion regarding fhe maximum level of heferogeneify is required, 
bul fhe idenlificafion of only one elemenl of fhe parfifion is guaranteed for all j. 

To recap, clique-based esfimafes are opfimal and should be preferred over our proposal or fhe 
basic approach in fhe asymplofic regime, assuming fhaf Kp is fixed and Uk/n —)• pk for some 
Pk G (0,1), for all k G [K]. In a non-asymplolic selling, resulfs for clique-based estimates are 
slill lacking, while conditions ensuring fhe sparsistency of our proposal and fhe basic approach are 
eslablished in fhe presenl arlicle. In fhe following simulation sludy, we especially compare fhe 
clique-based slrafegy and our proposal on finite samples. 


3. Simulation Study 

Theorem [^slates fhaf our proposal and fhe opfimal version of fhe basic approach perform similarly 
wilh regard fo fhe idenlificafion of Sg* and T^. for appropriate values of Ai and tq, under lechnical 
assumpfions on fhe design mafrices. The main objective of Ihis simulafion sludy is fo assess fhe 
empirical performance of our proposal under general designs, and for Ai and tq selected by 5-fold 
cross-validafion. Comparisons are made wilh fhe basic approach wifh fhe choice i = (1,...,1) as 
well as an opfimal choice i*. Clique-based esfimafes are also considered. 

We sel K = 20 and lake Uk G {10, 50,100} andp G {20,100, 500}. For each k G [K], rows of 
fhe design malrix are drawn from an N{0p, S) disfribulion, wilh S fhe {p x p) Toeplilz malrix 
wilh elemenl (i, j) equal fo 0.5l®“-^L We Ihen randomly selecl a subsel Po C \p] of size 20 and sel 
f^kj ~ 2 ^ Pq and k G [K], As for fhe values (/?! j, • • •, j) for j G Pq, we consider 

four levels of heferogeneify dn- More precisely, for any given du £ {1, 3, 6, 9}, we sel ■ = 1 
for k > dn, and = 1 -|- d^j for k < du for 10 indexes j randomly selected in Pq. For fhe 
olher 10 indexes in Po> we sel = 1 for k < du, and ■ = 1 + for k > dn- We furlher 
consider Iwo cases for fhe ■ values: Ihey are eilher conslanlly sel fo or drawn from fhe 
uniform disfribulion on [A'^/^/2, 2iT^/^] and Ihen multiplied by ±1 (wilh probabilily 1/2). When 
fhe j’s are consfanl, fhe collection of values {fdi j ,..., j) for each covariale j G Pq is made 
of Iwo groups of disfincl values, of sizes K — dn and du. This sifuafion should favor clique-based 
esfimafes. When <5^ ^ is random, fhe collecfion of values (/3{ ^,..., ld*p^ j ) for each covariale j G Pq 
is made ofdn + l groups, one of size K — dn, and fhe olher dn of size 1. This sifuafion should favor 
our proposal and fhe opfimal version of fhe basic approach. Observalions of fhe response variable 
are Ihen generated according fo wilh each componenf of drawn from 

an AA(0, cr^) disfribulion. The variance is sel fo YlkelK] giving overall signal- 

lo-noise ratio equal fo 1. For each particular choice of Uk G {10, 50,100}, p G {20,100, 500} 
and du £ {1,3, 6, 9}, and for bolh fhe random and consfanl choice for fhe ^’s, we generate 50 






replicates of data k G [K], Our results correspond to averages over these 50 replicates; 

see Figure [T] 

In all the configurations, = mode(0,/3J ‘,..., for all j G \p]. We then set = 

20 for all j G [p] for the optimal version of the basic approach. On the other hand, 
mode(0, /3J', ..., j) for all j G Pq. Under this setting, which is of course extreme, the com¬ 
parison between the results obtained using either £ or i* for the reference strata allows a precise 
description of the impact of the reference stratum on the performance of the basic approach. Top 
panels of Figure [^present results regarding the identification of the sets T’p„ = {(k,j) € \K] X F„ : 

7 ^ /31 ■} for the optimal version of the basic approach, our proposal and clique-based esti- 

ij j iJ 

mates, and that of the set G [K\ x Pq : f3l j basic approach. Here, 

we only consider covariates in Pq because they are those that are the most differently accounted for 
by the four approaches we compare. 

In the constant 5^ j case, our empirical results clearly illustrate our theoretical ones. First, our 
proposal performs nearly as well as the optimal version of the basic approach. Second, the lower 
dp, the better they perform, as expected since dp = Pi and Vq = 0 here. Third, it is more difficult 
to recover than Tp^, which is also expected since = K — dp > Pi- For the same 
reason however, as dp increases, p^ is easier to recover. A nice symmetry appears between the 
performance of our proposal and the optimal version of the basic approach on the one hand and the 
basic approach using £ = (1,..., 1), on the other hand. In addition, the recovery of the sets 
and is only marginally affected by p. Results in the random ■ case are mostly consistent 
with those in the constant case, except for the recovery of p^ which is mostly due to the fact that 
p^ = [iT — 1] X Pq irrespective of dp in this random case. Finally, the clique-based strategy 
performs similarly to, or a little worse than, our proposal with respect to this criterion. 

The bottom panels of Figure present results regarding log(X]A:e[i^l ~ 

( Dalalyan et al.[ 20171, which is a measure of the prediction error. Overall, our proposal and the 
optimal version of the basic approach perform similarly. They both outperform the basic approach, 
especially in the random <5^ ■ case. The clique-based strategy outperforms our proposal in the con¬ 
stant ■ case if dp is high enough, but only when the Uk/p ratio is not too small. In the random ■ 
case, our proposal clearly outperforms the clique-based strategy when p = 500, and more generally 
as dp increases and/or the rik/p ratio decreases. These results suggest that the clique-based strategy 
might not be able to fully account for, nor benefit from, the true structure in a high-dimensional set¬ 
ting. They also suggest that our proposal is better suited for this high-dimensional setting, as long 
as heterogeneity is not too high. 


4. Application on single-cell data 

We analyse data describing mielocytic leukemia cells undergoing differentiation to macrophage. 
Expression levels of 45 transcription factors are measured at iF = 8 distinct time points of this 
differentiation process (770, 771, 776, 7712, 7724, 7748, 7772 and 7796). Each time point defines a 


sfrafum where dafa on = 120 single cells are available. This dafa sef is described in Kouno 


ef al. (20131. In Ibis applicafion, fhe main objecfive is fo defermine how associations among fhe 


45 franscripfion faclors vary over time. Kouno el al. (20131 focus on marginal associalions and use 
univariafe analyses while graphical models, which describe conditional associalions, mighl be bef- 
fer suifed. Their inference can be reduced fo fhe idenlificafion and descriplion of fhe neighborhood 


of each covariale (Meinshausen Biihlmann 20061. Here, as a firsl sfep, we sludy how fhe neighbor- 
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Constant Constant Constant Constant Random Random Random Random 

diH = 1 dH = 3 dH = 6 diH = 9 cIh = 1 dH = 3 dH = 6 dH = 9 



Stratum Size nk 


Figure 1: Results from the simulation study. The top three panels show the aeeuraey regarding the 
reeovery of the set = {{k-,j) G [K] x Pq : j basie approaeh and 

the set Tp^ = {{k,j) £ [K] x Pq : ^ I3p for the other approaehes. The higher, 

the better. The bottom three panels illustrate the predietion error; the lower, the better. 
Results are presented for both the eonstant and random j eases. All results eorrespond 
to averages over 50 replieates in eaeh eonfiguration. Solid line: our proposal. Dotted 
line: optimal version of the basie approaeh. Dash-dot line: basie approaeh. Dashed line: 
elique-based approaeh. 


hood of one partieular transeription faetor, EGR2, varies over time. Towards this end, we eonsider 
stratified linear regression models that relates EGR2 to the other p = 44 faetors on the K = 8 
strata. Expression levels of EGR2 are eentered within eaeh stratum, and no intereept is ineluded 
in the models. Then, parameters of interest are veetors PI ,where G IR^^ deseribes the 
assoeiation between EGR2 and the p = 44 transeription faetors at the kth time point. We eompare 
estimates returned by our proposal and four eompetitors. The basie approaeh is eonsidered with 
two distinet ehoiees for the referenee stratum: we set it to either HO or H96 for eaeh eovariate. 
We further eonsider the elique-based strategy. Einally, given the ordinal nature of the strata in this 


10 































particular example, the variant based on chain graphs (Gertheiss & Tutz 20121 can be seen as the 
reference method. We include it as well, even if our main objective in this illustrative application 
is to compare the other four approaches, which do not account for this additional information. For 
each approach, regularization parameters are selected by 5-fold cross-validation. 


Basic approach, HO Basic approach, H96 Chain-based Ciique-based Our proposal 
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Figure 2: Estimation of the K = 8 parameter vectors in the linear regression models describ¬ 
ing the association between EGR2 and the p = 44 other transcription factors, at times 
HO, HI, HQ, H12, H24:, H48, H72 et H9Q. Each column corresponds to the estimation 
obtained according to one of the five considered approaches : the basic approach with 
the reference stratum set to either HO or H96 for every covariate, two versions of the 
generalized fused lasso estimates, one based on cliques and one based on chain graphs, 
and our proposal. 
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Results are presented in Figure For each method, the estimates correspond to a 44 x 8 
matrix of the form (/3i,... ,(5k) with K = 8 and j3k G IR'^^ for any k G [K], Therefore, for 
any j G \p], the yth row of each matrix corresponds to estimates {Pij ,..., f^Kj) of the effect of 
the jth covariate over the K time points, returned by the corresponding approach. This heat map 
representation allows an easy comparison of the pattern identified among the effects (/3ij ,..., 
of each covariate across the 8 strata. Our first objective is to illustrate the impact of the reference 
stratum when using the basic approach. Considering for instance the association between EGR2 
and MYB, most approach identify the same pattern: the association is constant between HI and 
H96, while it is lower, or even null, at HO. However, setting the reference stratum to HO, the basic 
approach does not detect any heterogeneity. These results are consistent with what is expected if 
the true pattern is the one identified by fhe ofher approaches: fhe basic approach used wifh HO as 
fhe reference sfrafum is unlikely fo identify fhe heferogeneify if if occurs af HO. Now consider fhe 
association wifh ELKl. Our proposal, fhe basic approach wifh HO as fhe reference sfrafum and 
fhe sfrafegy based on chain graphs all suggesf fhe absence of association befween EGR2 and EEKl 
on fhe fime inferval HO and H72, and a positive associafion af H90. If fhe reference sfrafum is 
sef fo H96, fhe basic approach suggesfs a quife differenf paffern, which is again expecfed if fhe 
frue paffern is fhe one refurned by fhe ofher approaches. These resulfs confirm fhaf fhe reference 
sfrafum is crifical for fhe basic approach. They further suggesf fhaf our proposal is able fo idenfify 
appropriafe covariafe-specific reference sfrafa. 

The comparison of fhe paffems refumed by our approach and fhe fwo fused lasso sfrafegies 
mosfly highlighfs fhaf fhe clique-based sfrafegy idenfifies fewer heferogeneifies and fhaf sfrafegy 
based on chain graphs refurns smoofher paffems. Again, fhese resulfs were expecfed given fhe 
connecfivify of fhe clique and chain graph, respecfively. Prediction error was evaluafed by double 
5-fold cross-validafion. Among fhe approaches fhaf do nof accounf for fhe ordering of fhe sfrafa, fhe 
besf predicfion error is obfained wifh our proposal, while fhe worsf is 1.8% higher and is obfained 
wifh fhe clique-based sfrafegy. The chain graph sfrafegy leads fo an improvemenf of 1.8% compared 
fo our approach. 

Two main conclusions can be drawn. When dafa come from several sfrafa of fhe populafion and 
no informafion is available regarding which sfrafa are likely fo share similar effecfs, our proposal is 
a compefifive approach. When addifional informafion is available, as in fhis particular application 
where sfrafa are nafurally ordered, accounting for if can be beneficial. 


5. Discussion 


Afler submiffing a firsf version of fhis work, we became aware of concurrenf work by Gross & 
TibshiranT| ( |2016 1 where fhe aufhors infroduce similar ideas. They apply if, in particular for fhe upliff 
problem in clinical research where fhe objecfive is fo find sub-populafions in a randomized frial for 
which an infervenfion is beneficial. In addifion, fhere has been a recenf line of works on penalized 
approaches aiming af identifying inferacfions, nof necessarily befween a cafegorical covariafe and 
ofher predictors ( Eim & Hasfie [ 2015 Radchenko & James] 20101. In fhis general confexf, sfrong 
hierarchy is often imposed: whenever an inferacfion befween fwo variables is included in fhe model, 
fhe corresponding main effecfs are included foo. However, fhis sfrong hierarchy is nof desirable 


in our selling, where a coefficienf can be nonzero in only one of fhe sfrafa (Gross & Tibshirani 


20161. Moreover, when applied in our setting, fhese approaches can be seen as versions of fhe basic 
approach Q based on exlensions of fhe Li-norm penally. In parlicular, a reference sfrafum has fo 
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be chosen as a first step, and the performance of these approaches would depend on this arbitrary 
choice. For instance, Radchenko & Jam^ (2010i establish assumptions under which their approach 
is sparsistent. When applied with the reference strata I, their main condition on the design matrix is 
very similar to {IC)^, which is generally stronger than {IC)^* and {IC)o. Then, these approaches 
could benefit from the ideas we developed in this article. 

Our proposal is based on an overparametrization, which naturally raises the question of iden- 
tifiability. We refer to Gross & Tibshirani ( 2016| l for some discussion. We shall add that there is 
no identifiability issue under the conditions of Theorem [T] If these conditions do not hold, and in 
particular if {IC)e* is not fulfilled, even the optimal version of the basic approach is not sparsistent, 
and the identifiability issue related to the overparametrization is secondary. 

Prediction bounds for our proposal can be derived under the conditions presented in this work. 
But weaker conditions might be sufficient, following recent work studying the lasso for correlated 
designs ( Dalalyan et al.[ 2017| ). Another extension might concern the derivation of valid p-values 
or confidence intervals for the nonzero parameters identified by our proposal. Given its connection 
with the lasso, this post-selection inference might be derived by extending recent strategies proposed 
for lasso estimates ( |Lee et alllpOlbl ). 

We also plan to extend our proposal to other regression models, which is straightforward for 
a variety of models given its connection with the lasso. In particular, our proposal could easily 
be extended to stratified Cox models used in survival analysis when competing risks arise ([Rosner 


et al. 20131, or to the conditional logistic models used in case-controls studies (Reid & Tibshiranil 


20141. The extension of clique-based estimates to other models is generally more computationally 


burdensome, partly because there is no proximal operator for the fused penalty. 
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Supplementary material 

Supplementary material available at Biometrika online includes two sections. Section 1 presents 
technical details: the proof of Lemma 1 and a generalized version of Lemma 1, the version of 
Theorem 1 in the balanced and orthogonal design along with its proof, and two corollaries describing 
the particular cases where Si* = 0 and Ti* = 0. Section 2 presents additional results from our 
empirical study: accuracies for the recovery of other sets of interest in the settings described heres, 
and additional results obtained under an alternative settings which should favor the clique-based 
strategy. 
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Appendix: Supplementary Files 

5.1 Technical details 

5.1.1 Proof OF Lemma 1 


Lemma 1 is established for matrix Aq; the proof for X£ follows from similar arguments and is 
omitted. 

Fix To > 0 and set Tk = for all k G [K]. Reeall that 9q = • • •, tkIq k^)'^^ 

with pf- = /?£* ■ for j G \p], and Jq = {j G [{K + l)p] : 9^- / 0}. For the sake of brevity, the 

proof is only presented in the ease where S^* / 0 and T^* / 0, where S^* = {j G \p\ : pg*j / 0} 

and T,. = {{k,j) : / pf^}. Setting T* = {j G \p] : {k,j) G T,.} and = 4^4"^ 

we have 


I xT - 

bi* n 


A’ojo — 


A 


y(i) 


{K) 

Si* 


X, 


(K) 
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i^OJo^OJo) = 


nl\ 


|S^* I 

'^Si* 

T1 

,T* 




XK 


y^(l) 


nl\ 


|T*| 


nil 


\T*\ 


y(K) 

^ c T’* 

K 

TR 


\ 


0 

0 


nl\ 


\T*\ 


For any s G [|5'£*|] and t G [IT^I], denote the sth element of S^* by S£*s and the fth element 
of by For all j G S^*, further denote by N* = n\K*\/K the number of observations 
in strata eontained in K* = {/c G [K] : = /Sp j}. Now introduee C, the matrix of size 

\T£* I X |S'£* I made of K bloeks Ck. Eaeh bloek is of size |T^| x 15^* |, and the element {t, s) of Ck 
is (C'k)(t,s) = -Tk/N*g^^^ if S(,*s = and 0 otherwise (s G [\Se*\] and t G [|T^|]). 
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For k,£ £ [K] and k ^ i, further introduce the matrix of size \ x | with element 

(fi,f 2 ) equal to ^fci ti ~ t 2 ~ some s G [|*S'£*|], and 0 otherwise. For 

k G [K], denote by the diagonal matrix of size \n\ X \n\ with fth diagonal term equal to 

'^ki^Se* + if t = for some s G [jS'*!] and r|/n^ otherwise. Finally denote 

by D the diagonal matrix of size |5£* | x |5£* | with jth diagonal term equal to 1/At*, and by B the 
matrix of size \Te*\ x \T£* \ made of blocks, with block {ki, ^ 2 ) equal to Bk^^k 2 - By standard 
algebra, we have 


(^oVoJo)"'= ^ B )■ 

Now, for any j ^ Jq, the jth column of Xq is of the form either (A) or (B): 

(A) Xoj = (0^,..., 0^, , 0 '^,..., 0^)^ for some ko G [K] and some jo ^ 

(B) Xoj = ,..., for some jo ^ Si*. 

If Xqj is of form (A), then XQj’"X qj^ = (cF, 0j^*|,..., Oj^^* |) G with d G IRI'^^*I and 

^ f tlfcg/Tfcg if jo = Si* Si 

^ 1 0 otherwise. 


Therefore, if X^j is of form (A), we have 

11 £^Oj^XoJo i^OJo^Jo ) ~^ I 


1 < max max —^(1 + ri). 
j&Si* k&K’^ TkN^ 


If Xqj is of form (B), then = (0|5^*|, di,..., d_fr)^ G with dk G IRIF'I and 


I nk/Tk ifjo = T/4, 
I 0 otherwise. 


In this case, we have 

\\^Oj^XoJo{^OJo'^^OJo)~^\\i < max ^ r^. 

Moreover, under the setting considered in Lemma I we have = n/AT for all fc G [K] and 
Tk = roA'“^/2_ Therefore, max^^j^j ||Aoj^Aojp(Aoj(,^Aojo)“^||i < 1 if and only if assumption 
{sIC)q holds, which completes the proof of Lemma 1 for matrix Xq. 


5.1.2 Version of Theorem 1 in the setting of orthogonal designs and balanced 

STRATA 

Theorem]^ below is the version of Theorem 1 in the setting considered in Lemma 1, where Uk = 
n/K and X^^^)/nk = In^ for all k G [K]. We set Vq = Vi*^ and T>i = Vi*, 1 ; see Section 

2.6 in the main text for the corresponding definitions. 


16 



Theorem 3 For all k G [K], assume that the noise variables i G [n^], are independent and 
identically distributed centered sub-Gaussian variables with parameter cr > 0. Further assuming 
that {sIC)o holds, define 


= Gl- iK-V,)r„ ) 


and 


Crain = mill ( 1,Tq ^ 




.-2 


For p G {0,1}, we set 


aS")> 


7min(l,ro) 


1/2 


AVi 


2cr^ log((iir + p)p) 


n 


— -r, 

’ \,k ~ ■ 


Finally introduce fi^^a ~ | + \Ti* \) , and consider the following /3-min 

conditions: 


(C„,.„ )(i) : Vi e s,; m > f(/t; {Cm )(m) : Vi e \p\yk i K', > 




' mill '^min / Q 

r/ien, S'£» and Tp are both recovered 

i^\2 

• with probability superior to 1 — 4exp(—cinA^ ), for some ci > 0, i/ie optimal version 
of the basic approach run with Ai = A^^^ and \ 2 ,k = A^*^^ under (C (o) )(*, ii) and we have 

’ ^min 

||^£* J(.* “ Iloo < 

f 1)^ 

• with probability superior to 1 — 4exp(—cinA^ ), for some ci > 0, owr approach run 
with Ai = a[^^ and \ 2 ,k = A^^^ under (C'oCi) )(g H) and we have ||0oJo “ ^ojq Iloo < /^mki- 

^ min 


Consider the asymptotic setting with K, and possibly p, tending to infinity as n —X oo. Further 
assume that Vp^ = Vp^i = Vp. If Vp <C or Vp = for some 0 < c < 1/2, then 

To = 1 ensures perfect recovery for signals such that = C>(n“^/^[(|5£* | + |T£*|) log((iC + 
l)p)]^/^), which is optimal up to log-terms. If Vp = cK^F for some c > 1/2, we get the same 
order of magnitude for /3^^, but with tq = (2c)< 1. If Vp, then the regime changes. 

For iFV2 <C Vp <C K the optimal choice is tq = j{2Vp) which only ensures perfect 
recovery for j3 ^}^ = 0 {{nK)~^Fj)^„ | + |Tr*|) log((iF -f l)p)]^/^). Finally, if Vp = cK for 

some 0 < c < 1/3, then the result of Theorem SI becomes almost meaningless: the optimal tq is 
0{K~^F'^ which only ensures perfect recovery for = 0(n“^/^[iT(|5£*| + |rr*|) log((iF + 
l)p)]V2). 


5.1.3 Proof OF Theorem [3] 

In view of Lemma 1, and because Xp = Xqj ^, we simply have to compute AminA/ q jg/n)) 
in order to apply Theorem 1 of Wainwright (|2009 1 and establish Theorem To do so, we look for 
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solutions of the characteristic polynomial of the matrix p(A) = det(A^(^^Afojo/n — 

AI| ju|). Using the block structure of matrix — A/|Jq|), we get the following expression 

p{x) = (1 - Ar(T-' - Ar n + 1 )+’ 

jes*n{Ufer*} A / 

withri = |5 '£*\{Ua:T^*}| andr 2 = | Jo|-|5£*|-|5'£*n{Ufcr^*}|. It follows that Amin Afo Jo/^)) > 

min ^ + 1) “ “ 1)^ + 42?ir“^/iir}^/^]). Denote by |||M|||oo the maximum 

row sum matrix norm of matrix M, and by |||M|||2 its spectral norm. Because ||| Aojo/n) |||oo < 

I Jo|^/^|||(A’(^^Aoj(,/n )|||2 < (IjS'r*! + D^/^/Cmin, Theorem[sjnow follows from Theorem 1 of 
|Wainwright| ( |2009| l. 


5.1.4 Generalization OF Lemma 1 


Here, we do not consider the orthogonal and balanced setting anymore and present general con¬ 
ditions ensuring that the irrepresentability conditions {IC)i and {IC)o are fulfilled by the design 
matrices and Aq involved in the basic approach and our proposal, respectively. For all k G [K], 
we assume that = To{nkln)^^'^ for some tq > 0 and that < 1 for all (j) G \p]. 

For X equal to either X^ or Xq, this ensures that n“^||A’j ||2 < max(l, t~^), for each column Xj of 
A”. 

For any given vector of reference strata i G [K]^ and any j G [p], set K£j = {k £ [K] ■ /3l.j = 
ji}, -} and = Kgj \ {(j}. Further set, for any k G [K], = {j G \p] : Plj + 1^1, j}’ = 


= X™ ,a„dZ« = (/„.-n«)x 

and af = Eu^xg'Ggt 


-(fc) ^—1 


rik) 


f\ Define = E,Jxg>"xf' 


Introduee the quantities Se - Lfce[A-] 4^^ and 


q(^) _ 


Finally set 


ci{l) = 


C2{i) = 


C2{i) = 


max 


A!{yl|i+ A , rk\ 


k&[K] 


E 

k£[K] 


1&[K] 


max max 
je[p] keKij 


max max 
j&[p] keK^j 


nw 



— + E 7 ll 4 nSlli + lldd + afHjh 

^ l^k 


y;^l|S!'''nglli + ll4'f!f+ fig| 


Lemma 4 Let (. G \KY be a given vector of reference strata. Assume that Amin(E£^fc) > 0/or 
k G [K] and AminlE^) > 0. Condition [ICf) holds if and only ifci{i) < 1 and C 2 {i) < 1. 

Assume that Amin(E^*^fc) > 0/or k G [K] and Amin(E£*) > 0. Condition {IC)q holds if and 
only if ci{C) < 1 and C 2 {C) < 1. 


The proof of Lemma|^follows from the same arguments as those presented in Section [5. 1.1 for the 
proof of Lemma 1, and is omitted. Again, the conditions ensuring that {IC)i* and {IC)o hold are 
very similar. This shows that our proposal is able to mimic the optimal version of the basic approach 
even when the designs are not orthogonal or strata are not balanced. 
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5.1.5 Other RESULTS 

Corollary 1^ and Corollary [^consider the two special cases where T^* = 0 and 5^* = 0, respectively. 

Corollary 5 Assume that the noise variables independent and identically cen¬ 
tered sub-Gaussian variables with parameter a > 0. Define X = ..., , the n x p 

matrix with all the strata pooled together. Set = To(nfc/n)^/^ > D for all k G [K], for some 
To > 0. For all j G \p\, assume that there exists some /5j G IR such that j = for all k G [K] 
and set Si* = {j G [p] : / 0}. Further assume that /n) — ^rain for some 

Cmin > 0, and that < 1 for all {k,j) G [K] x \p]. Finally assume that the three 

following conditions hold: 

(A) ^ Tfc > 1, 

k&{K] 

(C.i.l) Cl :=max||(X^^.X5,.)-'X^^.X,||i < 1, 

[C.ii) C2 := maxmaxT^-^IKX^ < 1. 

ie[p] k&[K] 


Now, set 7 = (1 — Cl) A (1 — C 2 ) and 

2 r2f72log((A: + l)p)\^/^ 


Ai — 


(1 A To)7 


n 


7 


/^min — Ai 


'\Si*\^l^ 


c„ 




Then our proposal run with parameters Ai and A 2 ,fc = t^Ai identifies Si* and Ti* = 0 with 
probability at least 1 — 4exp(—cinAf)/or ci > 0, as long as minjgs* |/3j| > /3min- 


Condition (C.Al) is exactly the irrepresentability condition on matrix X, while conditions {C.ii) 
and (A), which are very similar, both simply require that tq is high enough. Moreover, 7 = 1 — ci 
and 1A To = 1 for r high enough. Therefore, our proposal mimics the lasso run on (X, ‘3^) provided 
To high enough and is optimal, up to log-terms, when the /7^’s are all equal. 

Corollary 6 Assume that the noise variables {e[^'^)i^[nk],k&[K] ore independent and identically cen¬ 
tered sub-Gaussian variables with parameter u > 0. Set Tk = To(nfc/n)^/^ > 0/or all k G [K], 
for some tq > 0. For all j G [p], set Kj = {k £ [K] : ■ = 0} and for all k G [K], set 

= {j G [p] : ■ 7 ^ 0}. Assume that m.\\ik{N^i^{x':^} X^j:}/uk)} > Cminfarsome Cmin > 0, 

and that ni^^^'^\\Xj ^'^\\2 < lforall{k,j) G [K] x [p]. Further assume that the three following 
conditions hold: 

(A) Vj G [p], ^ < 1 + ^ Tk, 

e^K* k&K* 

(C.i.l) Cl := max max Tk\\{x!p} X^})~^X^} X-^^||i < 1, 

(C.ii) C 2 := max max x!^})~^X^} < 1. 

kelK]jifT* G' -'fe :> 
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Now, set 7 = (1 — Cl) A (1 — C 2 ), and 


Ai — 


(1 A To)7 1 


f 2o-^ log((i^ + l)p) ] 

n 1 


1/2 


/3min — Ai 


^min 


Then our proposal run with parameters Ai and A 2 ,fc = ^fcAi recovers Si* = 0 and Ti* = {(fc, j) : 
f^k j 0} with probability at least l—Aexp{—cin\\), for some Cl > 0, as long as > 

/3min(n./nfc)l/2. 


Condition (C.m) is exactly the union of the irrepresentability conditions for each matrix 
k G [K], while conditions (Ci.l) and A both simply require that tq is small enough. For tq 
small enough, our proposal then mimics the strategy consisting in performing K lasso on the data 
k = 1,... ,K, independently, with a common Ai value for each lasso. It is optimal, 
up to log-terms, when Si* = 0. 


5.2 Additional empirical results 

5.2.1 Under the designs considered in the main text 

Figure Isj presents additional results regarding the recovery of the set = {j G Pq : /^i j / 0} 
for the oasic approach run with reference strata I = (1,..., 1) and the recovery of the set S*p^ = 
{j e Po : / 0} for the other approaches. Overall, our proposal and the optimal version of 

the basic approach perform similarly according to this criterion too. In the constant <5^ ^ case, all 
methods perform similarly, and their performance does not depend on either por dji- In the random 
6^ j case, the performance of each method decreases as p and/or du increases. This discrepancy 
with the results obtained in the constant case illustrates that it is harder for the optimal version of 
the basic approach, our proposal and the clique-based strategy to determine whether the overall 
effect /3|* ■ of any covariate j is null when the collection of values (/?! j, • • •, f^*K j) varies around 
zero; keep in mind that in the random case, we have either = 1 or = 1 ± ■ with 

(5^ ■ ~ 12 2 x 1 / 2 ] so that ■ can be negative. Interestingly, it is harder for the basic approach 

to determine whether j is null in this situation too. In addition, the basic approach is generally 
outperformed by the other approaches in this random <5^ j case. The clique-based strategy performs 
well for dn > 6, especially when Uk/p is not too small. But it is outperformed by our proposal, 
and the optimal version of the basic approach, for dx = 3 and p = 500. 

5.2.2 Under AN ALTERNATIVE SCENARIO 

Here, we present additional empirical results obtained under a scenario which should favor the 
clique-based strategy. We still consider the case where K = 20 and Pq C [p], with \Pq\ = 20 but, 
for each j G [Pq], we set /3ij = • • • = = -a, Pej = ■■■ = jSioj = 0, /dnj = ■■■ = /3i5j = a 

and /3 i6j = • • • = /32o,j = 2a, for some a > 0. In other words, for each j G [Pq], the effects of 
the jth covariate across the 20 strata are made of 4 groups of distinct values, which should favor 
the clique-based strategy. Two values of a were considered, a = and a = Because 

results were very similar, only those obtained for a = jZ are presented here. Figure [^presents 
the predictive performance of each approach. We especially observe that our proposal performs 
nearly as well as the clique-based strategy in general, and outperforms it for p = 500. It also 
slightly outperforms the two versions of the basic approach. 
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Constant Constant Constant Constant Random Random Random Random 



Figure 3: Accuracy regarding the recovery the set = {j £ ^ /^i j / 0} for the basic 

approach and of the set Sp = {j G Pq : /3«* ■ 7 ^ 0} for the three other approaches. 

0 ij 

(Left): Constant <5^^ case. (Right): Random case. Results correspond to averages 
over 50 replicates in each configuration. Solid line: our proposal. Dotted line: optimal 
version of the basic approach. Dash-dot line: basic approach. Dashed line: clique-based 
approach. 


Figure [^presents the estimates returned by each approach for one particular simulation in the 
configuration = 100 and p = 100. It can especially be seen that the clique-based strategy does 
not fuse coefficients sensibly better than the other approaches. 
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100 p = 500 























Prediction error 


p = 20 p=100 p = 500 



Figure 4: Prediction error (the lower, the better). Results correspond to averages over 50 replicates 
in each configuration. Solid line: our proposal. Dotted line: basic approach with £ = 
(20,..., 20). Dash-dot line: basic approach with i = (!,...,!). Dashed line: clique- 
based approach. 
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True Our proposal Clique-based Basic approach (Wrong Rel Strata) Basic approach (Oracular Ref Strata) 



f 1 


Figure 5: Estimation of the K = 20 parameter vectors in one particular simulation with = 100 
and p = 100. The first column presents the true values. Each of the four remaining 
columns presents the estimates obtained according to one of the four considered ap¬ 
proaches: our proposal, the clique-based approach and the basic approach with the refer¬ 
ence stratum set to either 1 or 20 for every covariate. 
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